Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorisation #589

Closed
wants to merge 17 commits into from
Closed

Vectorisation #589

wants to merge 17 commits into from

Conversation

sv2518
Copy link
Contributor

@sv2518 sv2518 commented Jun 27, 2020

I would like to merge TJs changes to enable vectorisation of TSFC kernels.

This branch depends on https://github.com/firedrakeproject/loopy/tree/cvec.

#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch.

@sv2518 sv2518 requested review from tj-sun and wence- June 27, 2020 18:37
Copy link
Contributor

@tj-sun tj-sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite out of sync with the changes in PyOP2 and Firedrake since about 18 months ago, so I'm probably not qualified to review this. However, I did take a look and added some comments.

I remember this (or a similar) branch did pass all tests at one point of time (I can't quite recall if it is locally on my Linux machine or on Jenkins), but it is very likely that changes and/or new examples might introduce new issues.

One major feature that I didn't touch at all back then is detecting hardware/compiler environment and apply vectorisation accordingly. This is important esp. if you want to switch on vectorisation by default. I guess we could have the users to op in and not turn this on by default as the first step to get things going, but we should at least have some Jenkins jobs that test vectorisation in that case.

pyop2/sequential.py Outdated Show resolved Hide resolved
pyop2/sequential.py Outdated Show resolved Hide resolved
pyop2/sequential.py Outdated Show resolved Hide resolved
pyop2/sequential.py Show resolved Hide resolved
pyop2/configuration.py Outdated Show resolved Hide resolved
pyop2/codegen/rep2loopy.py Outdated Show resolved Hide resolved
pyop2/compilation.py Outdated Show resolved Hide resolved
pyop2/sequential.py Outdated Show resolved Hide resolved
pyop2/compilation.py Outdated Show resolved Hide resolved
@sv2518
Copy link
Contributor Author

sv2518 commented Jul 1, 2020

The firedrake test test_real_space.py::test_real_interpolate is failing with the following generated code and the answer is 4 times the answer of what we expect. It seems like there is another bug in the splitting. I think the iname should not be split at all, it is only of extent 1.

---------------------------------------------------------------------------
KERNEL: wrap_expression_kernel
---------------------------------------------------------------------------
ARGUMENTS:
end: ValueArg, type: np:dtype('int32')
glob0: type: np:dtype('float64'), shape: (1), dim_tags: (N0:stride:1) aspace: global
glob1: type: np:dtype('float64'), shape: (1), dim_tags: (N0:stride:1) aspace: global
start: ValueArg, type: np:dtype('int32')
---------------------------------------------------------------------------
DOMAINS:
[end, start] -> { [n_outer, n_batch] : n_batch >= start - 4n_outer and 0 <= n_batch <= 3 and n_batch < end - 4n_outer }
{ [expr_p0] : expr_p0 = 0 }
---------------------------------------------------------------------------
INAME IMPLEMENTATION TAGS:
expr_p0: None
n_batch: c_vec
n_outer: None
---------------------------------------------------------------------------
TEMPORARIES:
_zeros: type: np:dtype('float64'), shape: () scope:global
---------------------------------------------------------------------------
INSTRUCTIONS:
    for n_outer, n_batch
↱       glob0[0] = 0  {id=statement1}
├↱      ... nop  {id=expr__start}
││      for expr_p0
└└↱       glob0[expr_p0] = glob0[expr_p0] + glob1[0]  {id=expr_insn}
  │     end expr_p0
  └     ... nop  {id=statement0}
    end n_outer, n_batch
---------------------------------------------------------------------------

@wence-
Copy link
Member

wence- commented Jul 1, 2020

Can you dump the wrapper kernel before this splitting transform so we can have a look at what went wrong? Do loopy.dump_as_python(kernel) to get something that can be used to construct it independently.

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 1, 2020

import loopy as lp
import numpy as np

knl = lp.make_kernel(
    [
    "[end, start] -> { [n] : start <= n < end }",
    "{ [expr_p0] : expr_p0 = 0 }",
    ],
    """
    glob0[0] = 0 {id=statement1, inames=n}
    ... nop {id=expr__start, dep=statement1, inames=n}
    glob0[expr_p0] = glob0[expr_p0] + glob1[0] {id=expr_insn, dep=expr__start:statement1, inames=n:expr_p0}
    ... nop {id=statement0, dep=expr_insn, inames=n}
    """, [
        lp.ValueArg(
            name='start', dtype=np.int32),
        lp.ValueArg(
            name='end', dtype=np.int32),
        lp.GlobalArg(
            name='glob0', dtype=np.float64,
            shape=(1,), for_atomic=False),
        lp.GlobalArg(
            name='glob1', dtype=np.float64,
            shape=(1,), for_atomic=False),
        lp.TemporaryVariable(
            name='_zeros', dtype=np.float64,
            shape=(), for_atomic=False,
            address_space=lp.AddressSpace.GLOBAL,
            read_only=True,
            initializer=np.array(0.),
            ),
        ], lang_version=(2018, 2))

@wence-
Copy link
Member

wence- commented Jul 1, 2020

The loop splitting code doesn't seem to privatise the reduction variables properly for global reductions.

We start with (I've assumed start = 0, end mod 4 = 0 and end > start to remove some conditionals):

void loopy_kernel(int32_t const start, int32_t const end,
                  double *__restrict__ glob0,
                  double const *__restrict__ glob1) {
  for (int32_t n = 0; n <= -1 + end; ++n) {
    glob0[0] = 0.0;
    {
      int32_t const expr_p0 = 0;

      glob0[0] = glob0[0] + glob1[0];
    }
  }
}

So far, so good.

If I lp.split_iname(knl, "n", 4, inner_tag="c_vec"), then after code generation I get

void loopy_kernel(int32_t const start, int32_t const end,
                  double *__restrict__ glob0,
                  double const *__restrict__ glob1) {
  for (int32_t n_outer = 0; n_outer <= (-4 + end) / 4; ++n_outer) {
    for (int32_t n_inner = 0; n_inner <= 3; ++n_inner) {
      glob0[0] = 0.0;
    }
    {
      int32_t const expr_p0 = 0;

      for (int32_t n_inner = 0; n_inner <= 3; ++n_inner)
        glob0[0] = glob0[0] + glob1[0];
    }
  }
}

Notice how this is wrong, because we zero glob0[0] four times and then sum into it four times, rather than zeroing on every loop iteration at the same time as summing. This occurs, I think, because loopy doesn't know that the zeroing of glob0 is really in the same loop nest as the summation (because the former is created by the pyop2 wrapper gen, and the latter by tsfc, so they have different effective inames).

I wonder what one should have done.

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 1, 2020

I don't know. Which instruction belongs to which loop is usually defined with within_inames. Maybe we missed to specify that somewhere. But since the the instruction is in the loop body, I assume this is not it. Maybe the fact that it's wrapped in a Block is the problem.

@wence-
Copy link
Member

wence- commented Jul 1, 2020

I don't know. Which instruction belongs to which loop is usually defined with within_inames. Maybe we missed to specify that somewhere. But since the the instruction is in the loop body, I assume this is not it. Maybe the fact that it's wrapped in a Block is the problem.

The problem here is that the expr_p0 iname comes from tsfc, whereas the glob0[0] = 0 assignment from the wrapper. If they had the same iname, it would be OK, but we can't make that happen. I reported a loopy issue here: https://gitlab.tiker.net/inducer/loopy/-/issues/217

@wence-
Copy link
Member

wence- commented Jul 1, 2020

A workaround is to add yet another conditional to the case in which we switch vectorisation off again, see https://github.com/OP2/PyOP2/pull/589/files#diff-c87324f585685c592b11a1e78337d775R183

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 1, 2020

Do you expect this only to occur for writing into globals? I added the following conditional and was hoping the other failing test passes as well, but no.

has_wg = any(isinstance(arg.data, Global) and arg.access == WRITE for arg in self._args)
if isinstance(self._kernel.code, loopy.LoopKernel) and not (has_matrix or has_rw or has_wg):

@wence-
Copy link
Member

wence- commented Jul 1, 2020

No clue, but I'm not sure that it should only trigger then the arg is WRITE, I suspect it should trigger when the arg is not READ

@wence-
Copy link
Member

wence- commented Jul 1, 2020

All of these workarounds are not very satisfactory solutions: it feels like the implementation of the cvec and omp_simd backends in loopy are not very complete. But I guess one should get as much code merged as possible and then fix that.

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 1, 2020

Yes to both comments.

The line smoother test is still failing. It's only the one for the periodic hierarchy. The solver diverges in the end. I have no idea how to debug this.

@wence-
Copy link
Member

wence- commented Jul 1, 2020

The line smoother test is still failing. It's only the one for the periodic hierarchy. The solver diverges in the end. I have no idea how to debug this.

  1. Check the assembly of the operator is correct.
  2. Check the residual
  3. Can you get failure on a mesh with a single column? (It will take a different number of iterations from what the test asserts)

@wence-
Copy link
Member

wence- commented Jul 2, 2020

A workaround is to add yet another conditional to the case in which we switch vectorisation off again, see

Here's a proposed better fix. The loopy folks suggest privatizing (via a temporary variable) the reduction (see discussion here: https://gitlab.tiker.net/inducer/loopy/-/issues/217).

We can do this by producing code with temporaries at codegen time. I pushed 0677a24 which does this, and hopefully fixes the problem (and doesn't introduce other ones).

@wence-
Copy link
Member

wence- commented Jul 2, 2020

OK, that was a bad fix, will revert for now.

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 3, 2020

What went wrong with your fix, @wence- ?

@wence-
Copy link
Member

wence- commented Jul 3, 2020

Oh, it revealed firedrakeproject/firedrake#1768.

@sv2518 sv2518 force-pushed the vectorisation branch 3 times, most recently from 3d47231 to fdf3fad Compare July 6, 2020 10:32
@sv2518
Copy link
Contributor Author

sv2518 commented Jul 6, 2020

The kernel, that causes the failure in tests/regression/test_line_smoother_periodic.py::test_line_smoother_periodic is attached in vectorised and unvectorised form. The bug is in the value of n_outer in the initial slab.

Archive.zip

@wence-
Copy link
Member

wence- commented Jul 7, 2020

Can you show what values the start and end parameters have that result in bad values for n_outer in the initial slab?

Translating that loop structure to python I get:

def single(start, end):
    for i in range(start, end):
        print(i)


def fdivpos(a, b):
    if a < 0:
        a = a - (b - 1)
    return a // b


def unroll(start, end):
    n_outer = start - ((3 + 3*start) // 4)
    print("initial")
    if end - start - 1 >= 0:
        for b in range(start - 4*n_outer, 4 if end - 4*(n_outer+1) >= 0 else end - 4*n_outer):
            print(4*n_outer + b)
    print("bulk")
    for n in range(1 + start - ((3 + 3*start)//4), (3+end)//4 - 1):
        for b in range(4):
            print(4*n + b)

    print("final")
    n_outer = fdivpos(end - 1, 4)
    if 4*n_outer - start - 1 >= 0:
        for b in range(0, end - 4*n_outer):
            print(4*n_outer + b)

It's not obvious that this gives wrong iteration counts.

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 7, 2020

I think, the reason one cannot see the problem in the iteration count in your script is that the slab n_outer in the C kernel is differently rounded. To reproduce the problem you would need to exchange n_outer = start - ((3 + 3*start) // 4) with math.floor(start - ((3 + 3*start) / 4)). In that case the iterations over the initial slab are missing.
Start and end sets in the test are: (0,18), (18,18), (0,72), (72,72).

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 7, 2020

So I think in the C kernel we want int const n_outer = start - loopy_floor_div_pos_b_int32((3 + 3 * start), 4);

@wence-
Copy link
Member

wence- commented Jul 8, 2020

I think, the reason one cannot see the problem in the iteration count in your script is that the slab n_outer in the C kernel is differently rounded. To reproduce the problem you would need to exchange n_outer = start - ((3 + 3start) // 4) with math.floor(start - ((3 + 3start) / 4)). In that case the iterations over the initial slab are missing.
Start and end sets in the test are: (0,18), (18,18), (0,72), (72,72).

I'm pretty sure python and C ints have the same rounding behaviour.

I did a bit more digging, it turns out this is not a bug in the loopy codegen, but rather in values that we're sending in to the kernel.

You can reproduce by comparing the output with vectorisation off and on of this code:

from firedrake import *


def periodise(m):
    coord_fs = VectorFunctionSpace(m, "DG", 1, dim=2)
    old_coordinates = m.coordinates
    new_coordinates = Function(coord_fs)
    domain = "{[i, j]: 0 <= i < old_coords.dofs and 0 <= j < new_coords.dofs}"
    instructions = """
    <float64> Y = 0
    <float64> pi = 3.141592653589793
    for i
        Y = Y + old_coords[i, 1]
    end
    for j
        new_coords[j, 0] = atan2(old_coords[j, 1], old_coords[j, 0]) / (pi* 2)
        new_coords[j, 0] = new_coords[j, 0] + (1 if new_coords[j, 0] < 0 else 0)
        new_coords[j, 0] = 1 if (new_coords[j, 0] == 0 and Y < 0) else new_coords[j, 0]
        new_coords[j, 0] = new_coords[j, 0] * Lx[0]
        new_coords[j, 1] = old_coords[j, 2] * Ly[0]
    end
    """
    cLx = Constant(1)
    cLy = Constant(1)
    par_loop((domain, instructions), dx,
             {"new_coords": (new_coordinates, WRITE),
              "old_coords": (old_coordinates, READ),
              "Lx": (cLx, READ),
              "Ly": (cLy, READ)},
             is_loopy_kernel=True)
    return Mesh(new_coordinates, reorder=False)

H = 1
N = 3

mesh = periodise(CylinderMesh(N, N, 1.0, H, reorder=False))

V = FunctionSpace(mesh, "P", 1)
x, _ = SpatialCoordinate(mesh)
xc = interpolate(x, V)

print(xc.dat.data_ro)

# "working"
=> [0.66666667 1.         0.66666667 0.33333333 0.33333333 0.
 0.         0.66666667 0.33333333 0.66666667 0.33333333 0.        ]

# "broken"
=> [0.66666667 1.         0.66666667 0.33333333 0.33333333 1.
 0.         0.66666667 0.33333333 0.66666667 0.33333333 0.        ]

What's happening here is that we're interpolating a DG field into a CG space, and the values we get are dependent on the order in which we execute things. Now, PyOP2 has never guaranteed that it won't change the iteration order, so arguably this code has been relying on unspecified behaviour.

Actually fixing this properly needs a big rethink of the way the multigrid grid transfer is done, however, I can get the test to pass at least, by applying this patch in loopy:

diff --git a/loopy/codegen/loop.py b/loopy/codegen/loop.py
index c71554a4..43bff2c3 100644
--- a/loopy/codegen/loop.py
+++ b/loopy/codegen/loop.py
@@ -105,9 +105,9 @@ def get_slab_decomposition(kernel, iname):
         if upper_bulk_bound is not None:
             bulk_slab = bulk_slab.add_constraint(upper_bulk_bound)
 
-        slabs.append(("bulk", bulk_slab))
         if lower_slab:
             slabs.append(lower_slab)
+        slabs.append(("bulk", bulk_slab))
         if upper_slab:
             slabs.append(upper_slab)

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 8, 2020

If I execute that test I get a different output

"broken"
[0.         0.33333333 0.         0.33333333 0.66666667 0.66666667
 1.         0.33333333 0.66666667 1.         0.33333333 0.66666667]
"working, including your temporary fix":
[1.         0.33333333 1.         0.33333333 0.66666667 0.66666667
 1.         0.33333333 0.66666667 1.         0.33333333 0.66666667]

@sv2518 sv2518 force-pushed the vectorisation branch 3 times, most recently from 9d6e625 to 6cda3c1 Compare July 20, 2020 12:26
@sv2518
Copy link
Contributor Author

sv2518 commented Jul 21, 2020

The last failing firedrake test is tests/extrusion/test_real_tensorproduct.py now.

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 22, 2020

#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch.

@wence-
Copy link
Member

wence- commented Jul 22, 2020

#590 in PyOP2 (+ https://github.com/firedrakeproject/firedrake/tree/wence/fix/interpolate-max in firedrake) should be merged first, so I can drop those commits in this branch.

Done.

@sv2518 sv2518 force-pushed the vectorisation branch 2 times, most recently from 6420e9e to 856b6aa Compare July 22, 2020 15:16
@sv2518 sv2518 marked this pull request as ready for review July 22, 2020 15:17
@sv2518
Copy link
Contributor Author

sv2518 commented Jul 22, 2020

I have py-cpuinfo as required dependency. I added this in firedrake https://github.com/firedrakeproject/firedrake/compare/trigger-tests-for-vec and also in PyOP2 in this commit 1cf7698. Do I need both or is it enough in PyOP2?

@sv2518
Copy link
Contributor Author

sv2518 commented Jul 7, 2022

Superseded by #654

@sv2518 sv2518 closed this Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants